install.packages("tna")1 Introduction
Before building a TNA model, your raw event data needs to be reshaped into sequences. The prepare_data() function does this in one call. For most data, the call is three arguments and you are done.
This tutorial has two parts:
- Quick Start — the three-argument call that covers 90% of use cases. Start here.
- Detailed Guide — session splitting, time parsing, output anatomy, edge cases. Only needed when the defaults don’t fit your data.
Other tutorials:
- TNA Main Tutorial — building and analyzing a TNA model
- TNA Group Analysis — group comparisons, permutation testing, bootstrapping
- TNA Clustering — data-driven clustering of sequences
- TNA Compare — comparing two TNA models numerically
1.1 Installation
Or the development version:
# install.packages("remotes")
remotes::install_github("sonsoleslp/tna")2 Quick Start
2.1 Load your data
Your data should be in long format: one row per event, with columns for what happened, who did it, and when.
We use the built-in dataset as an example:
# Built-in dataset: coded collaborative regulation behaviors
data("group_regulation_long")
group_regulation_longThe columns that matter for prepare_data():
- Action — what happened (the behavioral state). These become network nodes.
- Actor — who did it (participant ID). One sequence per actor.
- Time — when it happened (timestamp). Used for sorting and session splitting.
Everything else (Achiever, Group, Course) is kept automatically as metadata.
2.2 Prepare the data
# Three arguments: action, actor, time. That's it.
pd <- prepare_data(
group_regulation_long,
action = "Action",
actor = "Actor",
time = "Time"
)That’s the whole call. Events are sorted by time within each actor, split into sessions when gaps exceed 15 minutes, and pivoted into wide format. Metadata columns are preserved automatically.
2.3 Build a model
# Build a TNA model from the prepared data
model <- tna(pd)
plot(model)2.4 Group comparisons
Any column you did not assign to action, actor, or time is preserved as metadata. You can use it for group comparisons without any extra work:
# The Achiever column was preserved automatically
group_models <- group_tna(pd, group = "Achiever")
plot(group_models)After preparing, you can inspect the statistics to see how many sequences were created, how many actors, sequence lengths, and the time range:
pd$statistics$total_sessions
[1] 2000
$total_actions
[1] 27533
$max_sequence_length
[1] 26
$unique_users
[1] 2000
$sessions_per_user
# A tibble: 2,000 × 2
Actor n_sessions
<int> <int>
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
# ℹ 1,990 more rows
$actions_per_session
# A tibble: 2,000 × 2
.session_id n_actions
<chr> <int>
1 1010 session1 26
2 1015 session1 26
3 1030 session1 26
4 1092 session1 26
5 1106 session1 26
6 1107 session1 26
7 1153 session1 26
8 1184 session1 26
9 1209 session1 26
10 1267 session1 26
# ℹ 1,990 more rows
$time_range
[1] "2025-01-01 10:01:16 EET" "2025-01-01 15:03:20 EET"
If these numbers look reasonable, you are ready to go.
That covers the typical workflow. If the defaults work for your data, you can stop here and move on to the main tutorial.
3 Detailed Guide (Not Usually Needed)
The rest of this tutorial covers situations where the defaults don’t fit: custom session thresholds, unusual timestamp formats, the full output structure, and troubleshooting. Read this when you run into a specific issue.
3.1 The Full Function Signature
prepare_data(
data,
action,
actor = NULL,
time = NULL,
order = NULL,
time_threshold = 900,
custom_format = NULL,
is_unix_time = FALSE,
unix_time_unit = "seconds",
unused_fn = NULL
)| Argument | Default | What It Does |
|---|---|---|
data |
(required) | Raw event data in long format |
action |
(required) | Column with events/states (become network nodes) |
actor |
NULL |
Column with participant IDs (one sequence per actor) |
time |
NULL |
Column with timestamps (sorting + session splitting) |
order |
NULL |
Column for tiebreaking same-timestamp events |
time_threshold |
900 |
Gap in seconds that starts a new session (default: 15 min) |
custom_format |
NULL |
strptime format for unusual timestamps |
is_unix_time |
FALSE |
Force Unix timestamp interpretation |
unix_time_unit |
"seconds" |
Unit for Unix timestamps |
unused_fn |
NULL |
Aggregation function for metadata during pivot |
Only data and action are required. But without actor, the entire dataset becomes one sequence — you lose the ability to do permutation testing, bootstrapping, or group comparisons. Always include actor unless your data genuinely has a single observation stream.
3.2 Session Splitting with time_threshold
When time is provided, prepare_data() splits each actor’s events into sessions. If two consecutive events are more than time_threshold seconds apart, a new session starts. The default is 900 seconds (15 minutes).
How it works for an actor with events at 9:00, 9:03, 9:07, 10:30, 10:32 with a 15-minute threshold:
- 9:00 → 9:03 (3 min gap) — same session
- 9:03 → 9:07 (4 min gap) — same session
- 9:07 → 10:30 (83 min gap) — new session
- 10:30 → 10:32 (2 min gap) — same session
Result: two sequences — (9:00, 9:03, 9:07) and (10:30, 10:32).
Changing the threshold affects how many sequences you get:
# 5-minute gaps start a new session → more, shorter sessions
pd_5min <- prepare_data(
group_regulation_long,
action = "Action", actor = "Actor", time = "Time",
time_threshold = 300
)# 1-hour gaps start a new session → fewer, longer sessions
pd_1hr <- prepare_data(
group_regulation_long,
action = "Action", actor = "Actor", time = "Time",
time_threshold = 3600
)# Compare
threshold_comparison <- data.frame(
Threshold = c("300s (5 min)", "900s (15 min, default)", "3600s (1 hour)"),
Sessions = c(
pd_5min$statistics$total_sessions,
pd$statistics$total_sessions,
pd_1hr$statistics$total_sessions
)
)
threshold_comparison- Chat or messaging data: 2–5 min. Conversations have rapid exchanges.
- LMS logs: 10–30 min. Students pause to read or think. The 15-minute default works well.
- Collaborative coding: 15–60 min. Longer focused work sessions.
- Diary studies: hours or days. Each entry is a separate session.
If unsure, try a few values and check $statistics. Sessions should be long enough to contain meaningful transitions (not just 1–2 events) but short enough that unrelated events aren’t chained together.
3.3 The order Argument
Some logging systems record multiple events with the exact same timestamp. The order argument provides a tiebreaker — a numeric column (step number, line number) that determines which event comes first among same-timestamp events.
# Use step_number to break ties when events share the same timestamp
prepared <- prepare_data(
my_data,
action = "Action", actor = "Actor", time = "Time",
order = "step_number"
)When both time and order are given, events are sorted by time first, ties broken by order. You can also use order without time (sorts by that column alone, no session splitting), but this is rarely needed — data is usually already in the right row order.
3.4 The Output Object
prepare_data() returns a tna_data object with five components:
| Component | What It Contains | When Present |
|---|---|---|
$long_data |
Original data + .standardized_time, .session_nr, .session_id, .sequence |
Always |
$sequence_data |
Wide format: rows = sequences, columns = time positions | Always |
$meta_data |
.session_id + all columns not assigned to action/actor/time/order |
Always |
$time_data |
Wide timestamps aligned with $sequence_data |
Only with time |
$statistics |
Session counts, user counts, actions per session, time range | Always |
3.4.1 Sequence data
Wide format: each row is one sequence, each column is a time position. Shorter sequences are padded with NA.
# First 5 rows, first 10 columns
pd$sequence_data[1:5, 1:10]3.4.2 Metadata
One row per sequence. Contains .session_id and every column not used as action/actor/time/order. This is how group_tna(pd, group = "Achiever") knows which sequences belong to which group.
pd$meta_data3.4.3 Long data
The original events, sorted and annotated with session information:
pd$long_data3.4.4 Time data
Wide-format timestamps aligned with $sequence_data. Each cell is the timestamp of the corresponding event. NULL when time is not provided.
pd$time_data[1:5, 1:8]print() method
You can inspect components without $ notation:
# Print sequence data
print(pd, data = "sequence")# A tibble: 2,000 × 26
Action_T1 Action_T2 Action_T3 Action_T4 Action_T5 Action_T6 Action_T7
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 cohesion consensus discuss synthesis adapt consensus plan
2 emotion cohesion discuss synthesis <NA> <NA> <NA>
3 plan consensus plan <NA> <NA> <NA> <NA>
4 discuss discuss consensus plan cohesion consensus discuss
5 cohesion consensus plan plan monitor plan consensus
6 discuss adapt cohesion consensus discuss emotion cohesion
7 discuss emotion cohesion consensus coregulate coregulate plan
8 cohesion plan consensus plan consensus discuss discuss
9 emotion cohesion emotion plan monitor discuss emotion
10 emotion cohesion consensus plan plan plan plan
# ℹ 1,990 more rows
# ℹ 19 more variables: Action_T8 <chr>, Action_T9 <chr>, Action_T10 <chr>,
# Action_T11 <chr>, Action_T12 <chr>, Action_T13 <chr>, Action_T14 <chr>,
# Action_T15 <chr>, Action_T16 <chr>, Action_T17 <chr>, Action_T18 <chr>,
# Action_T19 <chr>, Action_T20 <chr>, Action_T21 <chr>, Action_T22 <chr>,
# Action_T23 <chr>, Action_T24 <chr>, Action_T25 <chr>, Action_T26 <chr>
# Print metadata
print(pd, data = "meta")# A tibble: 2,000 × 7
.session_id Actor Achiever Group Course Time .session_nr
<chr> <int> <chr> <dbl> <chr> <dttm> <int>
1 1 session1 1 High 1 A 2025-01-01 10:27:07 1
2 10 session1 10 High 1 A 2025-01-01 10:23:45 1
3 100 session1 100 High 10 A 2025-01-01 12:11:50 1
4 1000 session1 1000 High 100 B 2025-01-01 11:12:00 1
5 1001 session1 1001 Low 101 B 2025-01-01 11:18:40 1
6 1002 session1 1002 Low 101 B 2025-01-01 11:18:53 1
7 1003 session1 1003 Low 101 B 2025-01-01 11:18:05 1
8 1004 session1 1004 Low 101 B 2025-01-01 11:22:26 1
9 1005 session1 1005 Low 101 B 2025-01-01 11:22:31 1
10 1006 session1 1006 Low 101 B 2025-01-01 11:15:23 1
# ℹ 1,990 more rows
3.5 Time Parsing
The time parser auto-detects 52 timestamp formats. You usually don’t need to do anything — just pass the column name.
Date + time (YYYY-MM-DD)
| Format | Example |
|---|---|
%Y-%m-%d %H:%M:%S |
2023-01-09 18:44:00 |
%Y-%m-%d %H:%M |
2023-01-09 18:44 |
%Y/%m/%d %H:%M:%S |
2023/01/09 18:44:00 |
%Y/%m/%d %H:%M |
2023/01/09 18:44 |
%Y.%m.%d %H:%M:%S |
2023.01.09 18:44:00 |
%Y.%m.%d %H:%M |
2023.01.09 18:44 |
ISO 8601 (T separator)
| Format | Example |
|---|---|
%Y-%m-%dT%H:%M:%S |
2023-01-09T18:44:00 |
%Y-%m-%dT%H:%M |
2023-01-09T18:44 |
%Y-%m-%dT%H:%M:%OS |
2023-01-09T18:44:00.123 |
With timezone offset
| Format | Example |
|---|---|
%Y-%m-%d %H:%M:%S%z |
2023-01-09 18:44:00+0100 |
%Y-%m-%d %H:%M%z |
2023-01-09 18:44+0100 |
%Y-%m-%d %H:%M:%S %z |
2023-01-09 18:44:00 +0100 |
%Y-%m-%d %H:%M %z |
2023-01-09 18:44 +0100 |
Compact (no separators)
| Format | Example |
|---|---|
%Y%m%d%H%M%S |
20230109184400 |
%Y%m%d%H%M |
202301091844 |
European (DD-MM-YYYY)
| Format | Example |
|---|---|
%d-%m-%Y %H:%M:%S |
09-01-2023 18:44:00 |
%d-%m-%Y %H:%M |
09-01-2023 18:44 |
%d/%m/%Y %H:%M:%S |
09/01/2023 18:44:00 |
%d/%m/%Y %H:%M |
09/01/2023 18:44 |
%d.%m.%Y %H:%M:%S |
09.01.2023 18:44:00 |
%d.%m.%Y %H:%M |
09.01.2023 18:44 |
%d-%m-%YT%H:%M:%S |
09-01-2023T18:44:00 |
%d-%m-%YT%H:%M |
09-01-2023T18:44 |
US (MM-DD-YYYY)
| Format | Example |
|---|---|
%m-%d-%Y %H:%M:%S |
01-09-2023 18:44:00 |
%m-%d-%Y %H:%M |
01-09-2023 18:44 |
%m/%d/%Y %H:%M:%S |
01/09/2023 18:44:00 |
%m/%d/%Y %H:%M |
01/09/2023 18:44 |
%m.%d.%Y %H:%M:%S |
01.09.2023 18:44:00 |
%m.%d.%Y %H:%M |
01.09.2023 18:44 |
%m-%d-%YT%H:%M:%S |
01-09-2023T18:44:00 |
%m-%d-%YT%H:%M |
01-09-2023T18:44 |
With month names
| Format | Example |
|---|---|
%d %b %Y %H:%M:%S |
09 Jan 2023 18:44:00 |
%d %b %Y %H:%M |
09 Jan 2023 18:44 |
%d %B %Y %H:%M:%S |
09 January 2023 18:44:00 |
%d %B %Y %H:%M |
09 January 2023 18:44 |
%b %d %Y %H:%M:%S |
Jan 09 2023 18:44:00 |
%b %d %Y %H:%M |
Jan 09 2023 18:44 |
%B %d %Y %H:%M:%S |
January 09 2023 18:44:00 |
%B %d %Y %H:%M |
January 09 2023 18:44 |
Date only
| Format | Example |
|---|---|
%Y-%m-%d |
2023-01-09 |
%Y/%m/%d |
2023/01/09 |
%Y.%m.%d |
2023.01.09 |
%d-%m-%Y |
09-01-2023 |
%d/%m/%Y |
09/01/2023 |
%d.%m.%Y |
09.01.2023 |
%m-%d-%Y |
01-09-2023 |
%m/%d/%Y |
01/09/2023 |
%m.%d.%Y |
01.09.2023 |
%d %b %Y |
09 Jan 2023 |
%d %B %Y |
09 January 2023 |
%b %d %Y |
Jan 09 2023 |
%B %d %Y |
January 09 2023 |
Unix timestamps (numeric seconds, milliseconds, or microseconds since epoch) are also detected automatically.
For unusual formats not covered by auto-detection:
# Custom format: "15-Mar-2024_14h30m"
prepared <- prepare_data(
data, action = "Action", actor = "Actor",
time = "Time", custom_format = "%d-%b-%Y_%Hh%Mm"
)For Unix timestamps (numeric seconds or milliseconds since epoch):
# Time column contains Unix timestamps in milliseconds
prepared <- prepare_data(
data, action = "Action", actor = "Actor",
time = "Time", is_unix_time = TRUE, unix_time_unit = "milliseconds"
)03/04/2024 could be March 4 (US) or April 3 (European). The parser tries US format first. If your data uses European dates and the day values never exceed 12, use custom_format:
# Force European date interpretation
prepared <- prepare_data(
data, action = "Action", actor = "Actor",
time = "Time", custom_format = "%d/%m/%Y %H:%M:%S"
)3.6 The unused_fn Argument
During the pivot from long to wide, metadata columns are collapsed from multiple rows per actor to one row per sequence. By default, the first value is taken. This works when metadata is constant within a session (e.g., achievement level doesn’t change between events).
If metadata varies within a session (e.g., a running score):
# Take the last value per session
prepared <- prepare_data(
data, action = "Action", actor = "Actor", time = "Time",
unused_fn = dplyr::last
)For most use cases (grouping variables, demographics), the default is correct.
3.7 Troubleshooting
One giant sequence: You forgot actor. Without it, the entire dataset becomes one sequence.
Too many / too few sessions: Adjust time_threshold. Check $statistics to see if the session count looks right.
Wrong event order: Without time, events are read in row order. If your data isn’t sorted, provide time.
Time parsing errors: Use custom_format to specify the format explicitly.
Very short sequences: Sequences of length 1 contribute zero transitions. Increase time_threshold to merge micro-sessions, or filter them out.
NA in the action column: Remove or impute before calling prepare_data().
$long_data
When something looks wrong, check the annotated long data:
# Check a specific actor's events
subset(pd$long_data, Actor == "some_actor_id")
# Look at session boundaries
library(dplyr)
pd$long_data |>
group_by(.session_id) |>
summarize(n_events = n(), start = min(.standardized_time),
end = max(.standardized_time))4 Quick Reference
4.1 Decision Guide
flowchart LR A["Do you have multiple participants?"] A -->|Yes| B["use actor"] A -->|No| C["omit actor (single observation stream)"]
flowchart LR D["Do you have timestamps?"] D -->|Yes| E["use time (sorting + session splitting)"] D -->|No| F["make sure rows are already in the right order"] E --> G["Are the default 15-min sessions appropriate?"] G -->|Yes| H["done!"] G -->|No| I["set time_threshold (in seconds)"] E --> J["Can multiple events share the same timestamp?"] J -->|Yes| K["also use order"] J -->|No| L["time alone is fine"]
flowchart LR M["Do you need group comparisons later?"] M -->|Yes| N["keep grouping variable as a column (preserved automatically)"] M -->|No| O["no extra steps"]
References
- Saqr, M., López-Pernas, S., Törmänen, T., Kaliisa, R., Misiejuk, K., & Tikka, S. (2025). Transition Network Analysis: A Novel Framework for Modeling, Visualizing, and Identifying the Temporal Patterns of Learners and Learning Processes. In Proceedings of the 15th International Learning Analytics and Knowledge Conference (LAK ’25) (pp. 351–361). ACM. https://doi.org/10.1145/3706468.3706513
- Tikka, S., López-Pernas, S., & Saqr, M. (2025). tna: An R Package for Transition Network Analysis. Applied Psychological Measurement. https://doi.org/10.1177/01466216251348840
- Package website: https://sonsoles.me/tna/
Citation
@misc{saqr2026,
author = {Saqr, Mohammed and López-Pernas, Sonsoles},
title = {TNA {Data} {Preparation:} {A} {Comprehensive} {Guide} to
`Prepare\_data()`},
date = {2026-02-06},
url = {https://sonsoleslp.github.io/posts/tna-data/},
langid = {en}
}